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Abstract 

A predecessor (successor) search finds the largest element x~ smaller than the input 
string x (the smallest element x + larger than or equal to x, respectively) out of a given 
set S; in this paper, we consider the static case (i.e., S is fixed and does not change 
over time) and assume that the n elements of S are available for inspection. We present 
a number of algorithms that, with a small additional index (usually of 0(n log w) bits, 
where w is the string length), can answer predecessor/successor queries quickly and with 
time bounds that depend on different kinds of distance, improving significantly several 
results that appeared in the recent literature. Intuitively, our first result has a running 
time that depends on the distance between x and x . it is especially efficient when the 
input x is either very close to or very far from x~ or x + ; our second result depends on 
some global notion of distance in the set S, and is fast when the elements of S are more 
or less equally spaced in the universe; finally, for our third result we rely on a finger 
(i.e., an element of S) to improve upon the first one; its running time depends on the 
distance between the input and the finger. 

1 Introduction 

In this paper we study the predecessor problem on a static set S of binary strings of length 
w. It is known from recent results [TTJ [T2] that structures a la van Emde Boas (e.g., y- 
fast tries [13]) using time O(logiu) are optimal among those using linear space. The lower 
bound proved in [TTJ [12] has actually several cases, another one is realised, for instance, in 
exponential trees [T] . A very comprehensive discussion of the literature can be found in Mihai 
Patra§cu's thesis [10] . 

Albeit the match between upper and lower bounds settles up the problem in the worst 
case, there is a lot of space for improvement in two directions: first of all, if access to the 
original set S (as a sorted array) is available, it is in principle possible to devise an index 
using sublinear additional space and still answer predecessor queries in optimal time; second, 
one might try to improve upon the lower bound by making access time dependent on the 
structure of S or on some property relating the query string x to the set S. 

In this paper, we describe sublinear indices that provide significant improvements over 
previous boundo Given a set S, we denote with x~ and x + the predecessor and successor in 
S of a query string x, and let d(x, S) — min{izi + — x, x — x~} and D(x, S) — max{x + — x, x — 
x~}. Note that d(x, S) is small when x is close to some element of S, whereas w — D(x, S) is 



Our space bounds are always given in terms of the additional number of bits besides those that are 
necessary to store S. 



small when x is far from at least one of x . Finally, let Am and A m be the maximum and 
minimum distance, respectively, between two consecutive elements of S. 

1. We match the static worst-case search time 0(log log d(x, S)) of [7], which was obtained 
using space 0(nw loglogw), but our index requires just O(nlogw) additional space 
(and thus overall linear space). 

2. We improve exponentially over interval-biased search trees [BJ, answering predecessor 
queries in timqj 0(log(iu — \ogD(x, S 1 ))), again using just 0(n\ogw) additional bits of 
space. 

3. We improve exponentially over interpolation search [8], answering predecessor queries 
in time 0(loglog(A^//A m )), always using just 0(n\ogw) additional bits of space. 

4. Finally, with slightly more (but still sublinear) space we can exploit a finger y £ S to 
speed up our second result to 0(log(log \x — y\ ~ \ogD(x, 5*))), which is in some cases 
better than the bound reported in [1], and improves exponentially over interval-biased 
search trees, which need time 0(\og(2 w — y) — \ogD(x, s)) [BJ. 

We remark that combining the first two results we show that predecessor search can be 
performed in time 0(logmin{ \ogd(x, S), w — logD(x, S) }) using O(nlogw) bits of additional 
space. Our results are obtained starting from a refined version of fat binary search in a z- 
fast trie [2] in which the initial search interval can be specified under suitable conditions, 
confirming the intuition that fat binary search can be used as a very versatile building block 
for data structures. 

2 Notation and tools 

We use von Neumann's definition and notation for natural numbers, and identify n = 
{0, 1, . . . , n — 1}, so 2 = {0, 1} and 2* is the set of all binary strings. If x is a string, x 
juxtaposed with an interval is the substring of x with those indices (starting from 0). Thus, 
for instance, x[a..b) is the substring of x starting at position a (inclusive) and ending at 
position b (exclusive). We will write x[a] for x[a..a\. The symbol ^ denotes prefix order, 
and -< is its strict version. Given a prefix p, we denote with p + 1 and p — 1 the strings in 2' p ' 
that come before and after p in lexicographical order; in case they do not exist, we assume 
by convention that the expressions have value _L All logarithms in this paper are binary and 
we postulate that logx = 1 whenever x < 2. 

Given a set S of n binary strings of length w, we let 

x~~ = max{y € S \ y < x} (the predecessor of x in S) 

x + = min{j/ £ S \ y > x} (the successor of x in S), 

where < is the lexicographic order. A predecessor/successor query is given by a string x, 
and the answer is x^. In this paper, for the sake of simplicity we shall actually concentrate 
on predecessor search only, also because our algorithms actually return the rank of the 

2 The bound in [6] is 0(w — log(x+ — x~)). Our proofs are correct even replacing D(x, S) with x+ — x~ , 
but the difference is immaterial as x + — x~ < 2D(x,S), and we like the duality with the previous bound 
better. 
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Figure 1: (above) A compacted trie, the related names, and the function T of the associated 
z-fast trie. The skip interval for a is [7. . 13]. Dashed lines show the end of the handles of 
internal nodes. 



predecessor in S, and thus are in principle more informative (e.g., the successor can be 
immediately computed adding one to the returned index). 

We assume to be able to store a constant-time r-bit function on n keys using rn+cn + o(n) 
bits for some constant c > 0: the function may return arbitrary values outside of its domain 
(for practical implementations see [3]). 

We work in the standard RAM model with a word of length w, allowing multiplications, 
and adopt the full randomness assumption. Note, however, that the dependence on multipli- 
cation and full randomness is only due to the need to store functions succinctly; for the rest, 
our algorithms do not depend on them. 

2.1 Z-fast tries 

We start by defining some basic notation for compacted tries. Consider the compacted trie [5] 
associated with a prefix- free set of string^] S C 2*. Given a node a of the trie (see Figure [I}: 

• the extent of a, denoted by e a , is the longest common prefix of the strings represented 
by the external nodes that are descendants of a (extents of internal nodes are called 
internal extents); 

• the compacted path of a, denoted by c a , is the string labelling a; 

• the name of a is the extent of a deprived of its suffix c a . 

• the skip interval of ot is [1 . . \c a |] for th6 root, and [\fia \ • ■ 

II for all other nodes. 

Given a string x, we let exit(x) be the exit node of x, that is, the only node a such that 
n a is a prefix of x and either e a — x or e a is not a prefix of x. We recall a key definition 
from [2]: 

3 Albeit the results of this paper are discussed for sets of strings of length w, this section provides results 
for arbitrary sets of prefix- free strings whose length is 0(w). 
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Definition 1 (2-fattest numbers and handles) The 2-fattest number of an interval (a. .b) 
of positive integers is the unique integer in (a . . b) that is divisible by the largest power of two, 
or equivalently, that has the largest number of trailing zeroes in its binary representation. The 
handle h a of a node a is the prefix of e a whose length is 2-fattest number in the skip interval 
of a (see Figure^). If the skip interval is empty (which can only happen at the root) we 
define the handle to be the empty string. 

We remark that if / is 2-fattest in (a . . b), it is also 2-fattest in every subinterval of (a . . b) 
that still contains /. 

Definition 2 (z-fast trie) Given a prefix-free set S C 2*, the z-fast trie on S is a function 
T mapping h a i— > e a) for each internal node a of the compacted trie associated with S , and 
any other string to an arbitrary internal extent. 

The most important property of T is that it makes us able to find very quickly the name 
of the exit node of a string x using a fat binary search (Algorithm Q]). The basic idea is that 
of locating the longest internal extent e that is a proper prefix of x: the name of exit(x) 
is then x[0 . . \e\ + 1). The algorithm narrows down an initial search interval by splitting it 
on its 2-fattest number (rather than on its midpoint). The version reported here (which 
builds upon [5 ) has two main features: very weak requirements on T, and the possibility of 
starting the search on a small interval. The latter feature will be the key in obtaining our 
main results. 



Algorithm 1 Fat binary search on the z-fast trie: at the end of the execution we return the 
name of exit(x). 

Input: a nonempty string x G 2*, an integer < a < \x\ such that a = or x[0 . .a) is an 
internal extent of the compacted trie on S, and an integer b < \x\ larger than the length of 
the longest internal extent of the compacted trie on S that is a proper prefix of x. 
Output: the name of exit(x) 

while b — a > 1 do 

1 / <— the 2-fattest number in (a . . b) 

2 e<-T(x[0..f)) 

3 if / < |e| A e -< x then a <- |e| { Move from (a . . b) to (|e| . . b) } 

4 else b «- / { Move from (a . . b) to (a . . /) } 

5 od 

6 if a = A e root ^ e return e 

7 else return x[0 . . a + 1) 



Lemma 1 Let po = e and p\, p2, ■ ■ ■ , pt be the internal extents of the compacted trie that 
are proper prefixes of x, ordered by increasing length. Let (a . .b) be the interval maintained 
by Algorithm^ Before and after each iteration the following invariants are satisfied: 

1. a = \pj\ for some j; 

2. \p t \ < b. 

Thus, at the end of the loop, a = \pt\. 
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Proof. (TT|) The fact that a = \pj | for some j is true at the beginning, and when a is reassigned 
(say, a «— |e|) it remains true: indeed, since e is an internal extent, a < f < |e| and e -< x, 
e = pk for some k > j. 

@ By ©: a is always the length of some pj, so b > \p t \ at the beginning, and then it can 
only decrease; thus, (a . . b) contains the concatenation of some contiguous skip intervals of 
the proper ancestors of exit (a;) up to the skip interval of exit (a;) (which may or may not be 
partially included itself). 

Now, assume by contradiction that when we update b there is a node a with extent e a 
which is a proper prefix of x of length / or greater. Since / is 2-fattest in (a . . 6), it would 
be 2-fattest in the skip interval of a (as the latter is contained in (o . . b)), so x[0 . . f) would 
be the handle of a, and T would have returned e Q , which satisfies / < \e a \ and e a -< x, 
contradicting the fact that we are updating b. We conclude that the invariant \p t \ < b is 
preserved. I 

Theorem 1 Algorithm^ completes in at most [log(6 — a)] iterations, returning the name of 
exit(x). 

Proof. We first prove the bound on the number of iterations. Note that given an interval 
{I . , r) in which there is at most one multiple of 2 l , the two subintervals {I. . f) and (/ . . r), 
where / is the 2-fattest number in (I . . r), contain both at most one multiple of 2 4_1 (if one 
of the intervals contained two such multiples, there would be a multiple of 2* inbetween, 
contradicting our assumption); this observation is a fortiori true if we further shorten the 
intervals. Thus, we cannot split on a 2-fattest number more than i times, because at that 
point the condition implies that the interval has length at most one. But clearly an interval 
of length t contains at most one multiple of 2 l~ log *1 , which shows that the algorithm iterates 
no more than [log(6 — a)~\ times. 

Finally, if t > then a;[0 . . \pt \ + 1) is the name of exit(x). Otherwise, exit(x) is the root 
(hence the special case in Algorithm 1). I 

Note that finding the 2-fattest number in an interval requires the computation of the most 
significant brQ, but alternatively starting from the interval (I . . r] one can set i = [log(r — 1)~\ 
(this can be computed trivially in time 0(log(r— £))) and then check, for decreasing i, whether 
(—1 <C i) & £ 7^ (—1 Ci)&r: when the test is satisfied, there is exactly one multiple of 2* in 
the interval, namely / = rfe— 1 <C i, which is also 2-fattest. This property is preserved by 
splitting on / and possibly further shortening the resulting interval (see the first part of the 
proof of Theorem [T]), so we can just continue decreasing i and testing, which requires still no 
more than [log(r — tf\ iterations. 

2.2 Implementing the function T 

A z-fast trie (i.e., the function T defining it) can be implemented in different ways; in partic- 
ular, for the purpose of this paper, we show that if constant-time access to the elements of S 
in sorted order is available, then the function T describing a z-fast trie can be implemented 
using additional 0(n\ogw) bits. We will use the notation S[i] (0 < i < \S\) for the z-th 
element of S. We need two key components: 

1. a constant-time function g mapping the handles to the length of the name of the node 
they are associated with (i.e., h a h4 \n a \ for every internal node a); 

4 More precisely, the 2-fattest number in (£ . . r] is —1 <§C msb(^ © r) & r. 
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2. a range locator — a data structure that, given the name of a node n a , returns the interval 
of keys that are prefixed by n a ; more precisely, it returns the smallest (left(n a )) and 
largest index (right(n Q ), respectively) in S of the set of strings prefixed by n a . 

The function g can be implemented in constant time using 0(n\ogw) bits, and there are 
constant-time range locators using 0(n log w) bits [2]. 

Now, to compute T(h) for a given handle h, we consider the candidate node name p = 
h[0 . . g(h)) and return the longest common prefix of 5 [left (p)] and S[right(p)]. If h is actually 
a handle, the whole procedure clearly succeeds and we obtain the required information; 
otherwise, we will be returning some unpredictable internal extent (unless left(p) = right (p), 
but this case can be easily fixed). Summing up, 

Theorem 2 If access to the set S is available, the z-fast trie can be implemented in constant 
time using additional 0(n\ogw) bits of space. 

This function enjoys the additional property that, no matter which the input, it will always 
return an extent. We also notice that using the same data it is also easy to implement a 
function that returns a node extent given a node name: 

Definition 3 (extent) Let p be a node name. Then extent(p) (the extent of the node named 
p) can be computed in constant time as the longest common prefix of S[\eit(p)] and S [right (p)]). 

2.3 Using the range locator to check prefixes 

Given a set P C Pref (5), we want to be able to check in constant time and little space that 
a prefix p either belongs to P, or is not a prefix of a string in S. Assume that we have a 
function / defined on P and returning, for each p G P, the length of the name of the exit 
node of p. Our key observation is that a range locator, combined with access to the array S, 
can be used to "patch" / so that it returns a special value _L outside of Pref (S): 

Theorem 3 Let P C Pref (S) and f : P — ¥ N be a constant-time function mapping p G P to 
l n exit(p)l- If access to the set S is available, using an additional constant-time range locator 
we can extend f to a constant-time function f : 2* — > N U {_L} such that f(p) = |7i ex it(p) I f or 
all p G P, and f(p) = _L for all p £ Pref (S). 

Proof. To compute /(p) for a p G 2* we proceed as follows: 

1. we compute the candidate length t = f(p) of the name of exit(p); 

2. if t < \p\ and p ^ extent (p[0 . . t)) wc return /(p), otherwise we return _L. 

Clearly, if p G P, by definition /(p) = « ox it(p)| ^ IpIi an d we compute correctly the extent 
of exit(p), so we return /(p) = |n cx it( p )|- On the other hand, if p S" Pref (S) it cannot be the 
prefix of an element of S, so in the last step we certainly return _L. | 

3 Locally sensitive predecessor search 

Our purpose is now to combine Theorem [5] and [3] to answer efficiently predecessor queries in a 
way that depends on the distance between the query string and its predecessor and successor. 
First of all, it is clear that we can easily compute the index of the predecessor of a string if 
its exit node is known (e.g., by fat binary search): 
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Definition 4 (pred, FBS ) Given a string x and the length t of the name of exit(x), we 
define the constant-time function pred(x, t) as follows: 

• if x < extent(x[0 . . t)), or if the first bit of x at which x and extent(x[0 . . t)) differ is a 
0, the index of the predecessor of x is left (exit (a;)) — 1 (we use the convention that —1 
is returned if no predecessor exists ); 

• otherwise, the index of the predecessor of x is right (exit (x))). 

We denote with FBS - (x, a, b) the predecessor index computed by running Algorithm^ (with 
inputs x, a and b) to obtain the name o/exit(x) and then invoking pred. 

We remark that the definition above implies that predecessor search (by means of FBS _ (x, 0, \x\)) 
is possible in time 0(logw) using an index of 0(nlogw) bits. 

The rest of this section is devoted at making the computation of the predecessor of x 
more efficient by storing selected prefixes of strings in S to reduce significantly the initial 
search interval of Algorithm [T] (i.e., to increase the parameter a). It turns out that this 
pre-computation phase does dramatically reduce the number of steps required, making them 
depend on the distance between the query string x and its predecessors and successors. More 
precisely, for a given set 5* and a string x, let us define 

d(x, S) = min{x + — x, x — x~} and D(x, S) = max{cc + — x, x — x~}; 

if only x~ (equivalently for x + ) is defined, we let d(x,S) = D(x,S) = x — x~ . We call 
d(x,S) (respectively, D(x,S)) the short distance (long distance) between x and S. We will 
devise two distinct predecessor algorithms whose performance depend on the short and on 
the long distance between the query string and the queried set S: both algorithms use the 
setup described in Theorem [3] but with a different choice of the function / : P N. 

Before proceeding with the presentation of the algorithms, it is worth observing the 
following lemmata: 

Lemma 2 Let x be a string, j < w — log d(x, S) and p = x[0 . . j). Then either p or p + 1 or 
p — 1 belong to Pref(S). 

Proof. Suppose that neither p nor p+1 nor p—1 belong to Pref(S'); there are 2 W ~3 strings 
prefixed by p (x being one of them), and the same is true of p — 1 and p+1. So, the element 
y G S that minimises \y — x\ (that will be one of x~ or x + ) is such that \y — x\ > 2 W ~ 3 . 
Hence d(x, S) > 2 W ~\ so j > w — \ogd(x, S), contradicting the hypothesis. I 

Lemma 3 Let x be a string; if p is a prefix of x such that p 6 Prcf(5) and \p\ > w — 
logD(x, S), then x is either smaller or larger than all the elements of S that have p as prefix. 

Proof. Suppose that there is some prefix p g Pref(S') of x longer than w — \ogD(x, S) and 
that there are two elements of S having p as prefix and that are smaller and larger than 
x, respectively; in particular, p is also a prefix of x + and x~ . Since p is the prefix of less 
than 2 1 °s £, ( a; > s ) = D(x, S) strings, x+ - x~ < D(x, S); but x+ - x~ > D(x, S), so we have a 
contradiction. I 
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3.1 Short-distance predecessor algorithm 

Our first improvement allows for the computation time to depend on short distances, using 
techniques inspired by [7]. To this aim, let us consider the following set of prefixes: 

P={x[0..w- 2 2 ') x G S and i = 0,1,..., Llog(logu; - 1)J }. 

To store the function / : P N needed by Theorem [3l we define a subset of P: 

Q = [J min .<{ p G P | n a ■< p < e a } 

node a 

In other words, for every node we take the shortest string in P that sits between the name 
and the extent of the node (if any). We can map every element q G Q to |ji e xit(g)| m space 
0(nlogw) as \Q\ < n. Then, we map every p G P to smallest i such that p[0 . . w — 2 T ) G Q- 
This map takes 0(n log log w log log log w) = 0{n\ogw) bits. To compute f(p), we first 
compute the index i using the second map, and then query the first map using p [0 . . w — 2 2 J . 

Algorithm [2] probes prefixes of decreasing lengths in the set X. More precisely, at each 
step we will probe a prefix p of length t = w — 2 2 of the query string x; if this probe fails, 
then p+ 1 and finally p — 1 are probed (if they exist). If we succeed in the first case, we have 
found a valid prefix of x in the trie, and we can proceed with a fat binary search. Otherwise, 
no element is prefixed by x, and if by any chance an element is prefixed by p — 1 or p + 1 we 
can easily locate its predecessor. 

Algorithm 2 Short-distance speedup. 

Input: a nonempty string x G 2 W 
Output: the index i such that S[i] = x~ 

i <- 

1 while 2 2 ' < w/2 do 

2 p <- x[0. . w - 2 r ) 

3 t <- /(p) 

4 if t ^ ± 

5 e <— extent (x [0 . .t)) 

6 if e -< a; return FBS _ (x, |e|, |a;|) { We found a long extent } 

7 return pred(cc, t) { We exit at the node of name x[0 . .t) } 

8 fi 

9 t< _/(p+i) 

10 if t ^ _L return left ((p + 1)[0 . .t)) — 1 { x~ is the predecessor of p + 1 } 

11 t< _/(p_i) 

12 if f _L return right ((p — 1)[0 .. t)) { x~ is the successor of p — 1 } 

13 od 

14 return FBS~(cc,0, |x|) {Standard search (we found no prefix long enough)} 



More precisely, it turns out that: 

Theorem 4 Algorithm^ returns the predecessor of x in time 0(log log d(x, S)), and requires 
an index of 0{n log w) bits of space (in addition to the space needed to store the elements of 
S). 



Proof. First we show that the algorithm is correct. If we exit at the first return instruction, 
e is a valid extent and a prefix of x, so we start correctly a fat binary search. At the second 
return instruction we know the x[0 . .t) is the name node a, but the extent of a is not a prefix 
of x, so x exits exactly at a, and again we return the correct answer. If p + 1 is a valid prefix 
of some element of S, but p is not, then the predecessor of p is the predecessor of the least 
clement prefixed by p + 1, which we return (analogously for p — 1). 

By Lemma El we will hit a prefix in our set P as soon as w — 2 2 < w — \ogd(x,S), 
that is, i > log log log d(x, S). If i is the smallest integer satisfying the latter condition, then 
i — 1 < log log log d(x, S), so 2 2 < (\ogd(x, S)) 2 , which guarantees that the fat binary search, 
which starts from an extent of length at least |e| > t > w — 2 2 ' > w — (log d(x, S)) 2 , will 
complete in time 0(log b — a) — 0(log log d(x, S)) (see Theorem[T]). If we exit from the loop, 
it means that i > log log log ^(a;,.!?) implies 2 2 > w/2, hence (log d(x, S)) 2 > w/2, so the last 
fat binary search (that takes 0(\ogw) steps to complete) is still within our time bounds. I 

3.2 Long-distance predecessor algorithm 

We now discuss Algorithm [3J whose running time depends on long distances. Let P be the 
set obtained by "cutting" every internal extent e a to the length of the smallest power of 2 
(if any) in the skip interval of a; more precisely: 

P= |J { e a [0 . .2 k ) | 2 k 6 [|n a | •■ |e a |] and k is the smallest possible}. 

a internal 

where a ranges over all nodes. Since this time we have at most one prefix per node, \P\ = 
0(n), so the function / required by Theorem [3] can be stored in 0(n\ogw) bits. 

Algorithm [3] keeps track of the length a of an internal extent that is known to be a prefix 
of x. At each step, we try to find another extent by probing a prefix of x whose length is the 
smallest power of two larger than a. Because of the way the set P has been built, we can 
miss the longest prefix length at most by a factor of two. 

Theorem 5 Algorithm [3] returns the predecessor of an input string x in time 0(\og(w — 
\ogD(x, S))), and requires an index of 0(n logw) bits of space (in addition to the space 
needed to store the elements of S). 

Proof. First we show that the algorithm is correct. It can be easily seen that at each step 
a is either or the length of an internal extent that is a prefix of x. Moreover, if there is an 
internal extent of length at least m that is a prefix of x, then t ^ _L, so we if we exit at the 
first return instruction, the fat binary search completes correctly. If t ^ _L, we know that 
x[Q . .t) is the name of a node a (because (a . . w) is a union of consecutive skip intervals, and 
the smallest power of two in such (a . . w) is a fortiori the smallest power of two in a skip 
interval): if x is smaller than the smallest leaf under a (or larger than the largest such leaf), 
we immediately know the predecessor and can safely return with a correct value. The return 
instruction at the exit of the loop is trivially correct. 

Observe that when m > w — log D{x, S) either the string x[0 . . m) will not be in Pref (S) 
(because of Lemma [3J and thus t — JL, or x will be larger (or smaller) than every element of 
S prefixed by x[0 . . t), which will cause the loop to be interrupted at one of the last two if 
instructions. Since m gets at least doubled at each iteration, this condition will take place in 
at most log(ii; — logD(x, S)) iterations; moreover, m < 2a (because there is always a power 
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of 2 in the interval (a . . 2a] ) , so the fat binary search in the first return will take no more 
than log(m — a) < log a < log(w — log D(x, S)). If the loop exits naturally, then there is a 
prefix of x belonging to Pref(S) and longer than w/2, hence w — \ogD{x, S) > w/2 and the 
fat binary search at the end of the loop will end within the prescribed time bounds. I 



Algorithm 3 Long-distance speedup. 

Input: a nonempty string x 6 2 W 
Output: the index i such that S[i] = x~ 

a <- 

1 while a < w/2 do 

2 m<— least power of 2 in (a . . w) 

3 t <- f(x[0 . . m)) 

4 if t — _L return FBS~ (x, a,m) {We obtained the longest possible prefix} 

5 p^x[0..t) 

6 if S[Mt(p)} > x return left(p) - 1 

7 if S [right (p)] < x return right (p) 

8 a <— |extent(p)| {This is a valid extent} 

9 od 

10 return FBS~(x, a, w) 



Finally, we can combine our improvements for short and long distances, obtaining an 
algorithm that is efficient when the input x is either very close to or very far from x" or x + : 

Corollary 1 It is possible to compute the predecessor of a string x in a set S in time 
0(log min{ logd(a;, 5), w — log D(x, S) }), using an index that requires 0(n log w) bits of space 
(in addition to the space needed to store the elements of S). 

4 Globally sensitive predecessor search 

We can apply Theorem [5] to improve exponentially over the bound described in [5] , which 
gives an algorithm whose running time depends on the largest and smallest distance between 
the elements of S. More precisely, let Am and A m be, respectively, the largest and smallest 
distance between two consecutive elements of S. 

Corollary 2 Using an index of 0(n log w) bits, it is possible to answer predecessor queries 
in time 0(loglog(A A //A m )). 

Proof. See the appendix. I 

5 Finger predecessor search 

We conclude with a generalisation of long-distance search that builds on previous results [4] . 
Using 0(nu; 1 / c ) bits (for any c) it is possible to answer weak prefix search queries in constant 
time. A weak prefix search query takes a prefix p and returns the leftmost and rightmost 
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index of elements of S that are prefixed by p; if no such element exists, the results are 
unpredictable (hence the "weak" qualifier), but a single access to the set S is sufficient 
to rule out this case and always get a correct result. Thus, we will be able to compute 
left(— ), right (—) and extent(— ) on arbitrary elements of Pref(S') in constant time. As a 
consequence, also pred(x, t) can be extended so to return a correct value for every t such that 
x[0..t) e Pref(S). 

The basic idea of Algorithmic is that of using a finger y 6 S to locate quickly an extent e 
that is a prefix of x with the guarantee that w — |e| < log \x — y\. The extent is then used to 
accelerate an algorithm essentially identical Algorithm [3l but applied to a reduced universe 
(the strings starting with e); the running time thus becomes 0(log(u> — |e| — logD(x, S)\) = 
0(log(log\x-y\-\ogD(x,S))). 



Algorithm 4 Long-distance finger-search speedup. 

Input: a nonempty string igT and ayeS such that y < x 
Output: the index i such that S[i] = x~ 

t<- max{ s | y[0 . . s) + 1 ^ x } 

1 e «- extent (y[0..t) + 1) 

2 if y[0.. + l 2< e return right(y[0 . . i)) {y[0..t) + 1 £ Pref(5) } 

3 if e 7^ x return pred(x, i) {a; exits between y[0 . . £) + 1 and e } 

4 a { Now e -< x and w — |e| < log |x — y\ } 

5 while a < (w — |e|)/2 do 

6 m ^— least power of 2 in (a — |e| . . w — |e|) 

7 p <- x[0. .m+ |e|) 

8 if p ^ Pref(S') return FBS _ (x,a+ |e|,m+ |e|) 

9 if 5[left(p)] > x return left(p) - 1 

10 if S[right(p)] < a; return right(p) 

11 a ^— |extent(p)| — |e| {This is a valid extent} 

12 od 

13 return FBS"(a;, a + \e\,w) 



Theorem 6 Algorithm^ returns the predecessor of an input string x given a finger y G S, 
withy < x, in time 0(log(log \x — y\ — logD(x, S))) using an index of (^{nw 1 ' ) bits of space, 
for any c (in addition to the space needed to store the elements of S). 

Proof. See the appendix. I 

References 

[1] Andersson, A., Thorup, M.: Dynamic ordered sets with exponential search trees. J. 
Assoc. Comput. Mach. 54(3), 1-40 (2007) 

[2] Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: 
Searching a sorted table with 0(1) accesses. In: SODA '09. pp. 785-794. ACM Press 
(2009) 



11 



[3] Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practise of monotone mini- 
mal perfect hashing. In: ALENEX 2009. pp. 132-144. SIAM (2009) 

[4] Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with 
applications. In: de Berg, M., Meyer, U. (eds.) Algorithms - ESA 2010. Lecture Notes 
in Computer Science, vol. 6346, pp. 427-438. Springer (2010) 

[5] Belazzougui, D., Boldi, P., Vigna, S.: Dynamic z-fast tries. In: Chavez, E., Lonardi, S. 
(eds.) SPIRE 2010. Lecture Notes in Computer Science, vol. 6393, pp. 159-172. Springer 
(2010) 

[6] Bille, P., Landau, CM., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random 
access to grammar compressed strings. In: SODA '11 (2011) 

[7] Bose, P., Dou'ieb, K., Dujmovic, V., Howat, J., Morin, P.: Fast local searches and 
updates in bounded universes. In: CCCG2010. pp. 261-264 (2010) 

[8] Demaine, E.D., Jones, T.R., Patra§cu, M.: Interpolation search for non-independent 
data. In: Munro, J.I. (ed.) SODA '04. pp. 529-530 (2004) 

[9] Knuth, D.E.: The Art of Computer Programming. Addison-Wesley (1973) 

[10] Patra§cu, M.: Lower bound techniques for data structures. Ph.D. thesis, Massachusetts 
Institute of Technology, Dept. of Electrical Engineering and Computer Science (2008) 

[11] Patrascu, M., Thorup, M.: Time-space trade-offs for predecessor search. In: STOC '06. 
pp. 232-240. ACM Press (2006) 

[12] Patrascu, M., Thorup, M.: Randomization does not help searching predecessors. In: 
SODA '07. pp. 555-564. SIAM, Philadelphia, PA, USA (2007) 

[13] Willard, D.E.: Log-logarithmic worst-case range queries are possible in space <d(N). 
Inform. Process. Lett. 17(2), 81-84 (1983) 

Appendix 

Proof, (of Corollary [5J We use a standard "universe reduction" argument, splitting the 
universe 2 W by grouping strings sharing the most significant [log n\ bits. Each subuniverse 
Ui has size 2 u, ~ri°g™l = 0(2 W /n), and we let Si = S fl Ui. Using a constant-time prefix-sum 
data structure we keep track of the rank in S of the smallest element of Si, and we build 
the indices that are necessary for Algorithm [3] for each Si (seen as a set of strings of length 
w — [log n] ). Thus, we can answer a query x in time 0(log(u> — [log n] — log D(x, Si)), where 
Ui is the subuniverse containing x. Now note that Am > 2 W jn, and that A m < x + — x~ = 
(x + — x) + (x — x~) < 2D(x, S) < 2D(x, Si) (unless x the smallest or the largest element of 
Si, but this case can be dealt with in constant time). The bound follows immediately. I 

Proof, (of Theorem [6]) First we show that the algorithm is correct. If we exit at the first 
return instruction, y[0 . . t) + 1 is not in Pref (S), which implies that x~ is prefixed by y[0 . . t), 
and thus the output is correct. If we exit at the second return instruction, x exits at the 
same node as y[0 . .t) + 1 = x[0 . . t). Otherwise, e is an extent that is a proper prefix of x. 
and the remaining part of the algorithm is exactly Algorithm [3] applied to the set of strings 
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of S that are prefixed by e, with e removed (the algorithm is slightly simplified by the fact 
that we can test membership to Prcf(5 l ) and compute extents for every prefix). Correctness 
is thus immediate. 

All operations are constant time, except for the last loop. Note that as soon as m + |e| > 
w — \ogD(x, S) the loop ends or a prefix of x is found (as in the proof of Algorithm [3|) , and 
this requires no more than \og(w — |e| — log_D(a;, S)) iterations; moreover, m < 2a (because 
there is always a power of 2 in the interval (a . . 2a] ) , so the fat binary search in the first 
return will take no more than log(m — a) < loga < \og(w — \e\ — \ogD(x, S)). If the loop 
exits naturally, then there is a prefix e' of x belonging to Pref(5') and longer than (w+\e\)/2, 
hence by Lemma El w — log D(x, S) > (w + \e\)/2; the fat binary search at the end takes time 
0(log(«;-(o+|e|))) = 0(\og(w-\e'\)) = 0(log(«/2-|e|/2))) = 0(log(w-\e\-logD(x, S))), 
within the prescribed time bounds. I 
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